{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 使用numpy中的 linalg来进行svd分解\n",
    "svd分解： $$ Data_{m \\times n} = U_{m \\times m} \\Sigma_{m \\times n} V^T_{n \\times n} $$\n",
    "其中 $ U_{m \\times m} $ 中为左奇异向量，前 $ r $ 个即为所求， $ \\Sigma_{m \\times n} $ 中对角线上值为奇异值，其余值为0，一般对角线上的值按从大到小排序，且前 $ r $ 个奇异值之和占据整个奇异值之和的大部分，$ r $ 个之后的可置为0，视为噪声或冗余特征。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "U: [[-0.14142136 -0.98994949]\n",
      " [-0.98994949  0.14142136]]\n",
      "Sigma: [1.00000000e+01 2.82797782e-16]\n",
      "VT: [[-0.70710678 -0.70710678]\n",
      " [ 0.70710678 -0.70710678]]\n"
     ]
    }
   ],
   "source": [
    "U, Sigma, VT = np.linalg.svd([[1,1],[7,7]])\n",
    "print('U:', U)\n",
    "print('Sigma:', Sigma)\n",
    "print('VT:', VT)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "def loadExData():\n",
    "    '''加载数据'''\n",
    "    return [[1,1,1,0,0],\n",
    "            [2,2,2,0,0],\n",
    "            [1,1,1,0,0],\n",
    "            [5,5,5,0,0],\n",
    "            [1,1,0,2,2],\n",
    "            [0,0,0,3,3],\n",
    "            [0,0,0,1,1]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "array([9.72140007e+00, 5.29397912e+00, 6.84226362e-01, 4.11502614e-16,\n",
       "       1.36030206e-16])"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Data = loadExData()\n",
    "U, Sigma, VT = np.linalg.svd(Data)\n",
    "Sigma"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "可以看到奇异值中前3个数值较大，而后两个太小。于是取前三个近似表示数据集矩阵 $ Data $:\n",
    "$$ Data_{m \\times n} \\approx U_{m \\times 3} \\Sigma_{3 \\times 3} V^T_{3 \\times r} $$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "尝试重构原始矩阵:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "matrix([[ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00,\n",
       "          7.75989921e-16,  7.71587483e-16],\n",
       "        [ 2.00000000e+00,  2.00000000e+00,  2.00000000e+00,\n",
       "          3.00514919e-16,  2.77832253e-16],\n",
       "        [ 1.00000000e+00,  1.00000000e+00,  1.00000000e+00,\n",
       "          2.18975112e-16,  2.07633779e-16],\n",
       "        [ 5.00000000e+00,  5.00000000e+00,  5.00000000e+00,\n",
       "          3.00675663e-17, -1.28697294e-17],\n",
       "        [ 1.00000000e+00,  1.00000000e+00, -5.48397422e-16,\n",
       "          2.00000000e+00,  2.00000000e+00],\n",
       "        [ 3.21319929e-16,  4.43562065e-16, -3.48967188e-16,\n",
       "          3.00000000e+00,  3.00000000e+00],\n",
       "        [ 9.71445147e-17,  1.45716772e-16, -1.52655666e-16,\n",
       "          1.00000000e+00,  1.00000000e+00]])"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "Sig3 = np.mat([[Sigma[0],0,0],[0,Sigma[1],0],[0,0,Sigma[2]]])\n",
    "Data_s = U[:,:3] * Sig3 * VT[:3,:]\n",
    "Data_s"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 小例子：基于协同过滤的推荐引擎\n",
    "利用用户的评价来进行物品之间相似度的计算。<br>\n",
    "评价矩阵: $$ \\begin{bmatrix} x_1^{(1)} & x_2^{(1)} & \\cdots & x_n^{(1)} \\\\ x_1^{(2)} & x_2^{(2)} & \\cdots & x_n^{(2)} \\\\ \\vdots & \\vdots & \\ddots & \\vdots \\\\ x_1^{(m)} & x_2^{(m)} & \\cdots & x_n^{(m)} \\\\ \\end{bmatrix}$$\n",
    "其中 $ m $ 表示用户数， $ n $ 表示物品数\n",
    "（1）欧氏距离(物品 $ x_i 与 x_j $ )\n",
    "$$ dist = \\sqrt{(x_i^{(1)} - x_j^{(1)})^2 + x_i^{(2)} - x_j^{(2)})^2 + ... + x_i^{(m)} - x_j^{(m)})^2} $$ \n",
    "（2）皮尔逊相关系数\n",
    "使用numpy中的numpy.corrcoef()来计算，由于皮尔逊系数的取值在 $ [-1,1] $ 之间，因此用 $ 0.5 + 0.5*np.corrcoef() $ 将值映射到 $ [0,1] $\n",
    "（3）余弦相似度\n",
    "$$ \\cos \\theta = \\frac{A \\cdot B}{|| A || || B ||} $$\n",
    "余弦相似度取值范围也是 $ [-1,1] $ ，因此使用2范数将值映射到 $ [0,1] $ 。 $ 向量[4,2,2]的 2范数=\\sqrt{4^2 + 2^2 + 2^2} $"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "def euclidSim(inA, inB):\n",
    "    '''欧氏距离'''\n",
    "    return 1.0 / (1.0 + np.linalg.norm(inA - inB))\n",
    "\n",
    "def pearsSim(inA, inB):\n",
    "    '''皮尔逊相关系数'''\n",
    "    # 二维或一维情况下，点A与点B可以确定一条直线\n",
    "    if len(inA) < 3 : return 1.0\n",
    "    print(np.corrcoef(inA, inB, rowvar = 0))\n",
    "    return 0.5 + 0.5 * np.corrcoef(inA, inB, rowvar = 0)[0][1]\n",
    "\n",
    "def cosSim(inA, inB):\n",
    "    '''余弦相似度'''\n",
    "    num = np.float(inA.T * inB)\n",
    "    denom = np.linalg.norm(inA) * np.linalg.norm(inB)\n",
    "    return 0.5 + 0.5 * (num / denom)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "np.corrcoef?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
